-
Notifications
You must be signed in to change notification settings - Fork 81
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Automatic calculation of a polygon area #1534
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks mostly ok, some comments:
- The templates need to CSS love; @clash99 is picking that up?!
- I think, there should be a database migration that calculates the areas for all existing spatial units.
- The
geometry_details
column is not 100% clear what this is. Could we rename it toarea
? I'm thinking that in the future, when we have length available as well, that there should be two columns in the export,area
andlength
, and we just leave the one empty that doesn't apply.
@@ -0,0 +1,26 @@ | |||
# -*- coding: utf-8 -*- |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We usually rename migration files to something readable so we know what it's about when we look at it in the migrations table; something like 0003_add_geometry_field.py
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have changed the implementation to be based on a PostGIS query, since the geometry is store and postgres has a built in area function there is no need to store it via a field anymore.
cadasta/spatial/models.py
Outdated
@receiver(models.signals.pre_save, sender=SpatialUnit) | ||
def define_geometry_details(sender, instance, **kwargs): | ||
geom = instance.geometry | ||
from django.contrib.gis.geos.polygon import Polygon |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you move the import to the top to the file?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This has been removed now all together
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This PR has a schema migration to add a geometry_details
field and tooling to add that data when a SpatialUnit
is created or changed, however this PR doesn't address the SpatialUnit
instances that already exist in the database. geometry_details
values should be generated for the existing data via a data migration.
Additionally, I see that the area is stored as a string. Is there any advantage to this? Keeping it as a float
will keep querying options more open for future needs.
In [1]: su = SpatialUnit.objects.first()
In [2]: su.geometry_details
In [3]: su.save()
In [4]: su.geometry_details
Out[4]: {'area': '413525.04'}
Ultimately, I question if there is actually a need to store a geometry's area as another field (or a portion of another field). This data can be queried from the DB and will be gauranteed to be up-to-date. It feels like storing data in the DB that is derived from other data in the DB is an uphill battle, keeping the values in-sync always feels like a chore and is best avoided if possible.
It seems like we have two needs: 1) get area for single SpatialUnit, 2) get area for many SpatialUnit instances.
Getting the area for a single unit could be a property on the model returning self.geometry.transform(3857, clone=True).area
and getting the area from multiple instances can be done via aggregation/annotation:
from django.contrib.gis.db.models.functions import Area, Transform
from django.db.models import Sum
# Get all SpatialUnits with a geometry
qs = SpatialUnit.objects.exclude(geometry=None)
# Add a field to each SpatialUnit with the area of its transformed geometry
qs = qs.annotate(area=Area(Transform('geometry', 3857)))
# Get sum of all areas
area = qs.aggregate(Sum('area'))
print(area['area__sum'].sq_m)
PostgreSQL and PostGIS are optimized for these types of operations and will likely be faster than doing them in Python.
Based on @alukach feedback I think we can remove the geometry_details from the model and simple handle this in the context with PostgreSQL and PostGIS. Which will eliminate @oliverroick concern about the naming convention of geometry_details and area... Also the back processing will not need to take place since nothing will be changing to the SpatialUnit once migrations is removed.. @alukach & @oliverroick I will make appropriate changes.. Thanks for your feedback!!! |
@oliverroick - yes, let me know once this is approved and I will need to update the project dashboard. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this looks really good. There looks like there's one possible bug in the code.
Style thought: What you have (calling location.geometry.transform(3857, clone=True).area
in a few places) totally works, however it may make your life easier to just add a @property
for area
on the SpatialUnit
model. That way you would only need to write self.geometry.transform(3857, clone=True).area
once and handle situations where location.geometry == None
in one location.
cadasta/spatial/views/default.py
Outdated
@@ -116,6 +116,7 @@ def get_context_data(self, *args, **kwargs): | |||
pass | |||
|
|||
location = context['location'] | |||
context['area'] = location.geometry.transform(3857, clone=True).area |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's possible that a location
could not have a geometry
value. I believe this will throw some errors in that situation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added a property property name area to the SpatialUnit to handle when location.geometry == None. Also the property as you mentioned allowed the removed of redundant geometry.transform(3857, clone=True).area
cadasta/core/templatetags/filters.py
Outdated
else: | ||
ac = area * 0.00024711 | ||
formated_area = format(ac, '.2f') + ' ac' | ||
return formated_area |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Style thought: I believe that these filters can be reduced in size a bit. No need to declare an empty string on the first line.Rather than formated_area = format(...
, you can simply return
the values from those lines.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I updated this to simply be the returns and removed unnecessary formated_area variable.
@alukach @oliverroick Thanks for all your feedback!!!, I have made the changes requested. Let me know if you see anything else that needs to be fixed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good; thanks, Jon!
@clash99 So the PR is good to go. I think we should handle the changes to the dashboard this way:
That way this and the #1537 are not dependent on each other and can be merged any time when they are ready. |
We need to hold off on merging this PR; the solution as-is won't scale well and will introduce a hit to performance/page load times. Location geometries infrequently (if ever) change, so their area calculations should absolutely be stored in the database table alongside the geometry. Re-calculating the area every time the value is accessed is inefficient and unnecessary. It's trivial to ensure that a location's area stays updated in the event of a change to its geometry -- use a db trigger to detect changes to the geometry and automatically recalculate and store its area. This lets us leverage the efficiencies of calculation in PostGIS (rather than Python), eliminates any issues around keeping the area value up to date, and makes accessing the area value a simple read op (which means no negative performance impact). This is a beneficial optimization in the context of displaying a single location's area on the location detail page. However, it becomes essential in the context of our other intended use for area: displaying the cumulative area of all locations in a project on the project dashboard. In this PR's implementation, that means every single time a user views their project dashboard page, we will be querying the database for each location, calculating the area of each location, adding all those areas, and displaying the total to the user. This value doesn't get stored or cached anywhere, so every dashboard page load and every refresh for every user will be repeating those steps. @oliverroick mentioned that testing this in his local dev VM with 60k locations took ~600ms. That's with a local database, local platform server, no latencies, no other traffic, one user at a time, no other database transactions, etc. In a live production environment, this implementation will negatively and unnecessarily impact performance. Our page load times are already cumbersome on slow connections, and there's no reason to calculate these values on access. To implement this properly, we should: calculate and re-calculate the area via trigger whenever a location's geometry is created or updated, store the area value in the database, calculate and re-calculate a project's cumulative area via trigger whenever any of its location geometries are created/updated, and store that area value in the database. Every access of location or project area is a read. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Comments in PR body
@amplifi Storing the area via the database was how I originally implemented this. Thank goodness for version control. Ill re-implement with storing the area in the database. |
@amplifi thoughts on storing in 1 measurement. Originally was implemented where it was stored in 4 different measurements. I think that calculation would be less of an issue. Thoughts? cc @jnordling |
@jnordling I think I've come around to agreeing with @amplifi. I was going to make the case that we should avoid storing redundant data in the DB as it felt like a premature optimization, that the queries aren't really that slow, and that we should instead lean on caching views (which would have greater performance gains than caching data in the DB as template rendering is likely our biggest cost). However, testing the area calculation for 800k locations was like 10s, which is obviously to long to run even intermittently. DB triggers are a great way to ensure that an area is calculated when a location is uploaded or changed (I'm not familiar enough to say how easy/difficult it would be to use triggers for summing the areas of all locations in a projects), however I think the big downside to this is that managing triggers isn't particularly well supported in Django (we could lean on data migrations for this) and it divides logic between the Django system and Python which makes the system a bit more complex. Signals (as were used in the initial implementation) are a good way to avoid this but can easily miss changes (such as when using So yeah, even despite the complexities, I think DB triggers are the way to go. A DB trigger that would only calculate area for a few polygons that were saved and update the project with a sum does seem like the best performance/effort ratio. Remember that DB triggers do block the transaction, however I think this is acceptable being that, as said by @amplifi, reads are far more common that writes. Regarding @wonderchook's comment, I agree that storing them in a single unit (I recommend sq meters) and converting in Python seems like the best way to keep things simple. |
@alukach My understanding of the signal pre_save is that it works for update and create or any save for that matter. Is this not right? We are already doing this for the check_extent when SpatialUnit are updated/created or saved. For bulk_create.. What about implementing a bulk_create override method for the SpatialUnit model since signals do not work in that instance. Just thoughts, I also don't see any bulk_create for SpatialUnit (but I understand wanting to plan for it)... I think triggers sounds nice in theory but as you mentioned, introducing a new level of complexity. Anyways just thoughts.. Ill start researching DB triggers for this. |
Admittedly, I might be acting overly picky about this. Perhaps signals are good enough for now (not sure how often anyone would actually use |
@alukach Im referencing, seems not terrible. http://eflorenzano.com/blog/2008/11/04/database-triggers-arent-evil-and-they-actually-kin/.. Im going to give it a go and see what happens... :) |
Hey @jnordling, is there any progress on this? Do you need anything from our end? |
Hey @oliverroick . I think I can finishing this up in the coming days as long a django signals are ok to use? |
Signals should be ok, we're using them elsewhere too. |
Updated this PR to uses signals, removed the UI component so there will be no conflict with Add user measurement system #1610 or the new dashboard UI.. Also added the data migration. I think this PR needs to be tag for migration?.. Any ways have a look. @oliverroick. Let me know if it needs anything else. |
if item.area: | ||
value = item.area | ||
else: | ||
value = None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm a bit confused by this, what's the purpose of this if item.area
check? Additionally, why aren't we using value.area
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I removed this. It was unnecessary.
geom = sp.geometry | ||
if geom and isinstance(geom, Polygon) and geom.valid: | ||
sp.area = geom.transform(3857, clone=True).area | ||
sp.save() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel like we could do this more performantly with F
expressions, pushing the calculation to the DB:
from django.contrib.gis.db.models.functions import Area, Transform
from django.db.models import F
SpatialUnit.objects.exclude(geometry=None).update(area=Area(Transform('geometry', 3857)))
My main concern is that SpatialUnit.objects.all()
does not chunk the results at all by default, so this will request every field of every row from the SpatialUnit
table at once which may bring up memory concerns on the server in production.
However, given my track record on this PR, you may want to take that with a grain of salt 😉 . I'll let @amplifi comment on whether she thinks doing the calculation for all SpatialUnit
instances with geometries in a single transaction will be too much for the DB (I'm not familiar with the # of rows in the prod DB or the DB server's capabilities). If this is deemed to be an issue, I'd recommend still using .update()
with an F
expression but on batches of the queryset (you can slice the queryset to generate these batches).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I updated the migration with your suggestion to use the F
expressions. Also added so it would only preform on Polygons.
SpatialUnit.objects.exclude(geometry=None).extra(
where=["geometrytype(geometry) LIKE 'POLYGON'"]).update(
area=Area(Transform('geometry', 3857)))
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me.
…al to calculate geometry details, m2,ft2,ha,ac
…that areas are only calculated for polygons
…ult context for display in location_details
…for both project dashboard and location details
…postgis query and project sum, updated tests and linting
…t code, made code style changes and added updated test based on new property
cad07f9
to
51e0184
Compare
Hi! Based on my investigation, I think the method used to compute the area of polygons is very wrong. See the bug I filed at #1689. |
Proposed changes in this pull request
To automatic calculation of a polygon area (for project locations). This was added by creating a geometry_details to the SpatialUnit model, which then the area is calculated durning creation/edit of a SpatialUnit. The view handles the conversion of the units. The area is calculated and stored in meters squared. There are two filter functions which can be used in the view to convert. On the Location Detail view if Polygon type the area will be displayed. On the Project Dashboard there is the sum of the total area for that project. The geometry details has also been added to the data export for the project.
The issues and requirements for this PR is #811
When should this PR be merged
Soon, preferably
Risks
None I can think of.
Follow-up actions
I connected with @clash99 about the UI. She will be making adjustments to the style of how they are represented in the new project dashboard view in sprints to come
Checklist (for reviewing)
General
migration
label if a new migration is added.Functionality
Code
Tests
Security
Documentation